We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

West Lake, TX

SUMMARY:

  • 8+ yearsof experience asBig Data Engineer/ Data Engineer and Hadoop Developer including designing, developing and implementation ofdata modelsfor enterprise - level applications and systems.
  • Experienced in managing Hadoop clusters and services usingClouderaManager.
  • Experience in developingcustomUDFsfor Pig and Hive to in corporate methods and functionality of Python/Java intoPig LatinandHQL(Hive QL) and Used UDFs from Piggybank UDF Repository.
  • Experienced in running query - usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
  • Good experience inOozieFramework and Automating daily import jobs.
  • Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive and map Reduce.
  • Highly experienced in importing and exporting data betweenHDFSandRelational Database Management systemsusingSqoop.
  • Expertise in using various Hadoop infrastructures such asMap Reduce, Pig, Hive, Zookeeper, Hbase, Swoop, Oozy, Flume, Drillandsparkfor data storage and analysis.
  • Implemented various algorithms for analytics usingCassandrawithSpark and Scala.
  • Experienced in Creating Inboards for data visualization inPlatformfor real - time dashboard on Hadoop.
  • Collected logs data from various sources and integrated in to HDFS usingFlume.
  • Good understanding ofNoSQLData bases and hands on work experience in writing applications on No SQL data bases likeCassandraandMongo DB.
  • Extensively worked on theAWS S3 and IAM Featuresto load the data to snowflake.
  • Designed and implemented a product search service using Apache Sold.
  • Good Understanding of Azure Big data technologies like Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Factory, Azure Data bricks, and created POC in moving the data from flat files and SQL Server using U-SQL jobs.
  • Good knowledge in querying data fromCassandrafor searching grouping and sorting.
  • Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, Map Reduce, Hive, SQL and Spark to solve big data type problems.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
  • Expertise withBig data on AWS cloud services i.e. EC2, S3, Auto Scaling, Glue, Lambda, Cloud Watch, Cloud Formation, Athena, Dynamo DB and Redshift
  • Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
  • Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
  • Creative skills in developing elegant solutions to challenges related to pipeline engineering
  • Strong experience in core Java,Scala, SQL, PL/SQL and Restful web services.
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, calculated fields, Sets, Groups, Parameters etc., in Tableau experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Worked on various programming languages using IDEs like Eclipse, NetBeans, and Interlay, Putty, GIT.

TECHNICAL SKILLS:

Big Data Technologies: HDFS, Map Reduce YARN, Hive, Pig, Hbase, Impala, Zookeeper, Sqoop, Oozie, Kafka, DataStax & Apache Cassandra, Drill, Flume, Snowflake, Spark, Solr and Avro.

Programming Languages: Scala, Python, SQL, Java, PL/SQL, Linux shell scripts

RDBMS: Oracle 10g/11g/12c, MySQL, SQL server, Teradata, MS Access

No SQL: Hbase, Cassandra, Mongo DB

Cloud Technologies: MS AZURE and AWS

Web/Application servers: Tomcat, LDAP

Methodologies: Agile, UML, Design Patterns (Core Java and J2EE)

Tools Used: Eclipse, Putty, Cygwin, MS Office

BI Tools: Platfora, Tableau, Pentaho

WORK EXPERIENCE:

Senior Big Data Engineer

Confidential, West Lake, TX

Responsibilities:

  • Worked inAWSenvironment for development and deployment of custom Hadoop applications.
  • Worked extensively onsparkandMLlibto develop aregression modelfor cancer data.
  • Hivetables are created as per requirement wereInternalorExternaltables defined with appropriatestatic, dynamic partitions and bucketing, intended for efficiency.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
  • Experienced with AWS services to smoothly manage application in the cloud and creating or modifying the instances.
  • Extracted and loaded CSV files, json files data from AWS S3 to Snowflake Cloud Data Warehouse.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
  • Experienced with AWS AZURE services to smoothly manage application in the cloud and creating or modifying the instances.
  • Load and transform large sets of structured, semi structured data using hive.
  • Handle billions of log lines coming from several clients and analyze those using big data technologies likeHadoop (HDFS), Apache KafkaandApache Storm.
  • Hands on design and development of an application using Hive (UDF).
  • Developed Simple to complex MapReduce streaming jobs using Python language that are implemented using Hive and Pig.
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
  • Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
  • We used the most popular streaming toolKafkato load the data on Hadoop File system and move the same data to Cassandra NoSQL database.
  • Worked on Snowflake advanced concepts like setting upvirtual warehouse sizing, query performance tuning, Data Sharing, UDF, Zero copy clone, Time travel and Data Pipelines (Streams and Tasks).
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Configured Spark Streaming to receive real time data from theKafkaand store the stream data to HDFS.
  • Migrated existingMapReduceprograms toSparkusingScalaandPython.
  • ImplementedSpark SQLto connect toHiveto read the data and distributed processing to make highly scalable.
  • Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
  • Exported the Analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
  • Experience inRedesigning and Migrating/ Other data warehousesto Snowflake Data warehouse.
  • ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
  • IntegratedCassandraas a distributed persistent metadata store to provide metadata resolution for network entities on the network
  • Writing map reduce code using pythonin order to get rid of certain security issues in the data.
  • Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
  • Used Pig Latin at client side cluster and HiveQL at server side cluster.
  • Importing the complete data from RDBMS to HDFS cluster usingSqoop
  • Deployed the Big DataHadoop applicationusingTalend on cloudAWS (Amazon Web Services) and also on Microsoft Azure.
  • Involved in Designing and Developing Enhancements of CSG using AWS APIS.
  • After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application usingSpark streamingandKafka.
  • Created and maintained various DevOps related tools for the team such as provisioning scripts, deployment tools, and development and staging environments on AWS, Rack space and Cloud.

Environment: HDFS, Hive, Scala, Sqoop, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Elastic search, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, Snowflake, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes

Big Data Engineer

Confidential, Albany, NY

Responsibilities:

  • Worked inAgile environment,and used rally tool to maintain the user stories and tasks.
  • Implemented ApacheDrillon Hadoop to join data from SQL and NoSQL databases and store it in Hadoop.
  • Used Avro serializer and Avro De serializer for developing the Kafka clients.
  • Developed Kafka Consumer job to consume data from Kafka topic and perform validations on the data before pushing data into Hive and Cassandra databases.
  • Applied spark streaming for real time data transforming.
  • Worked with Spark using Scala and Spark SQL for faster testing and processing of data.
  • AppliedSparkadvanced procedures liketext analytics and processingusing thein-memoryprocessing.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
  • ImplementedApache Sentryto restrict the access on the hive tables on a group level.
  • EmployedAVROformat for the entire data ingestion for faster operation and less space utilization.
  • Experienced in usingPlatforaa data visualization tool specific for Hadoop, and created various Lens and Viz boards for a real-time visualization from hive tables.
  • Created and implemented variousshell scriptsfor automating the jobs.
  • Deploying windows Kubernetes (K8s) cluster with Azure Container Service (ACS) from Azure CLI and Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test and Octopus Deploy.
  • Experience in building data pipelines using Azure Data factory, Azure Databricks and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data warehouse and Controlling and granting database access.
  • Migrated complex MapReduce programs into in memory Spark processing using Transformations and actions.
  • Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks.
  • Implemented a generic ETL framework withhigh availabilityfor bringing related data for Hadoop & Cassandra from various sources using spark.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
  • ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
  • Implemented test scripts to support test-driven development and continuous integration.
  • Used Spark for Parallel data processing and better performances.
  • Created architecture stack blueprint for data access with NoSQL DatabaseCassandra;
  • Brought data from various sources in to Hadoop and Cassandra usingKafka.
  • Experienced in usingTidal enterprise scheduler and OozieOperational Services for coordinating the cluster and scheduling workflows.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Created various types of data visualizations using Python and Tableau.
  • Create and deploy Azure command line scripts to automate tasks.
  • Building the pipelines to copy the data from source to destination in Azure Data Factory.
  • Queried and analyzed data fromCassandrafor quick searching, sorting and grouping throughCQL.
  • Implemented various Data Modeling techniques forCassandra.
  • Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them.
  • Participated in various upgradations and troubleshooting activities across enterprise.
  • Knowledge in performancetroubleshooting and tuningHadoop clusters.
  • Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
  • Worked withEnterprise data supportteams to install Hadoop updates, patches, version upgrades as required and fixed problems, which raised after the upgrades.
  • Involved in creating the notebooks for moving data from raw to stage and then to curated zones using Azure data bricks.
  • Used SQL Azure extensively for database needs in various applications.
  • Created multiple dashboards in tableau for multiple business needs.
  • Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF’s for Pig Latin.
  • Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.

Environment: MapR, spark, Scala,Solar, Java, Azure SQL, Azure Data bricks, Map Reduce, Azure Data Lake, HDFS, Hive, pig, Impala,Cassandra, Python, Kafka, Tableau, Teradata, CentOS, Pentaho, PIG, Zookeeper, Sqoop.

Data Engineer

Confidential, Minneapolis, MN

Responsibilities:

  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin
  • Integrated Kafka with Spark Streaming for real time data processing
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Developed code to handle exceptions and push the code into the exception Kafka topic.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Implementing and Managing ETL solutions and automating operational processes.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
  • Created and maintained documents related to business processes, mapping design, data profiles and tools.
  • Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the map reduces jobs that extract the data on a timely manner
  • Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running ad-hoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.

Environment: SQL Server, Erwin, Kafka, Python, Map Reduce, Oracle, AWS, Redshift, Informatica RDS, NoSQL, MySQL, PostgreSQL.

Data Engineer

Confidential

Responsibilities:

  • Involved in design and development phases of Software Development Life Cycle (SDLC) using Scrum methodology.
  • Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Importing and exporting data intoHDFSfrom database and vice versa usingSqoop.
  • Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest behavioral data into HDFS for analysis
  • Implemented optimization and performance tuning in Hive and Pig.
  • Developed job flows in Oozie to automate the workflow for extraction of data from warehouses and weblogs.
  • Designed and implemented a Cassandra NoSQL based database that persists high-volume user profile data.
  • Migrated high-volume OLTP transactions from Oracle to Cassandra
  • Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling
  • Created Data Pipeline of Map Reduce programs using Chained Mappers.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • UsedMavenextensively for building jar files ofMap Reduceprograms and deployed to Cluster.
  • Worked with NoSQL databases like Hbase, Cassandra, DynamoDB (AWS) and MongoDB
  • Developed suit of Unit Test Cases forMapper, ReducerandDriverclasses usingMR Testinglibrary.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
  • ModelledHivepartitions extensively for data separation and faster data processing and followedPigandHivebest practices for tuning.
  • Loaded the aggregated data onto DB2 for reporting on the dashboard.
  • Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
  • Created customized BI tool for manager team that perform Query analytics using HiveQL.

Environment: RHEL, HDFS, Map-Reduce, AWS, Hive, Pig, Sqoop, Flume, Oozie, Mahout,HBase, Hortonworks data platform distribution, Cassandra.

Hadoop Developer

Confidential

Responsibilities:

  • Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake.
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Experience creating and organizing HDFS over a staging area.
  • Troubleshot RSA SSH keys in Linux for authorization purposes.
  • Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Created a data service layer of internal tables in Hive for data manipulation and organization.
  • Achieved business intelligence by creating and analyzing an application service layer in Hive containing internal tables of the data which are also integrated with Hbase
  • Used Git for version control with colleagues

Environment: Hadoop, Hive, Hbase, Spark, Python, Pandas, PL/SQL, My SQL, SQL Server, PostgreSQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP, Git.

We'd love your feedback!