We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Auburn Hills, MI

PROFESSIONAL SUMMARY:

  • Above8+years of experience asBig Data Engineer /Data EngineerandData Analystincludingdesigning, developingand implementation ofdata modelsfor enterprise - level applications and systems.
  • Extensive knowledge in working withIDETools such asMy Eclipse, RAD, IntelliJ,NetBeans
  • Expert inAmazon EMR,S3, ECS, Elastic Cache, Dynamo DB, Redshift.
  • Experience in installation, configuration, supporting and managing -Cloudera Hadoopplatform along withCDH4&CDH5clusters.
  • Experience inDimensional Data Modeling, Star/Snowflake schema, FACTandDimension tables.
  • Experience on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage and cloud SQL.
  • Experience in Worked onNoSQL databases-HBase, Cassandra & MongoDB, database performance tuning & data modeling.
  • Expertise in writingHadoopJobs to analyze data usingMapReduce, Apache Crunch, Hive, Pig, and Splunk.
  • Good experience working on analysis tool likeTableauforregression analysis, pie charts, and bar graphs.
  • Extensive experience in Technical consulting and end-to-end delivery withdata modeling, data governance.
  • Implemented a distributing messaging queue to integrate withCassandrausingApache Kafka.
  • Excellent Knowledge in understandingBig Datainfrastructure, distributed file systems -HDFS, parallel processing -MapReduceframework.
  • Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
  • Experience in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along with data analytics and data wrangling
  • Well experience inNormalizationandDe-Normalizationtechniques for optimum performance inrelationalanddimensional database environments.
  • Good Working onApache NifiasETL toolfor batch processing and real time processing.
  • Involve in writingSQL queries, PL/SQL programmingand created new packages and procedures and modified and tuned existing procedure and queries usingTOAD.
  • Proficient knowledge in Designing and implementing data structures and commonly useddata business intelligence toolsfordata analysis.
  • Extensive experience in writing Storm topology to accept the events fromKafkaproducer and emit intoCassandra DB.
  • Excellent working withdata modeling tools like Erwin, Power DesignerandER Studio.
  • Proficient working experience onbig data toolslikeHadoop, Azure Data Lake, and AWS Redshift.
  • Strong experience inData Migration, Data Cleansing, Transformation, Integration, Data Import,andData Export.
  • Excellent technical and analytical skills with clear understanding of design goals and development forOLTPanddimension modelingforOLAP.
  • Strong experience in migratingdata warehousesand databases intoHadoop/NoSQL platforms.
  • Extensive experience in usingPL/SQLto writeStored Procedures, FunctionsandTriggers.
  • Experience inData transformation, Data Mappingfrom source to targetdatabase schemas, Data Cleansing procedures.
  • Extensive experience in development ofT-SQL, Oracle PL/SQL Scripts, Stored ProceduresandTriggersfor business logic implementation.
  • Expertise inSQL Server Analysis Services (SSAS)andSQL Server Reporting Services (SSRS)tools.
  • Designing and DevelopingOracle PL/SQLandShell Scripts, Data ConversionsandData Cleansing.
  • Experienced in working with different scripting technologies likePython, Unix shell scripts.
  • Good Knowledge inAmazon Web Service(AWS) concepts likeEMRandEC2 web.

TECHNICAL SKILLS:

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop 1.4, Kafka, Oozie, Hadoop

BI Tools: SSIS, SSRS, SSAS.

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS, Google Cloud.

Databases: Oracle, Teradata, Mysql.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE:

Confidential, Auburn Hills, MI

Sr. Big Data Engineer

Responsibilities:

  • Architected, Designed and Developed Business applications andData martsfor reporting.
  • DevelopedBig Datasolutions focused on pattern matching and predictive modeling
  • Designed the data schema and the project relevant tables using MYSQL on Google Bigquery (a MYSQL query editor on google cloud platform)
  • Worked on creating data ingestion processes to maintain Global Data lake on teh GCP cloud and Big Query
  • Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
  • Using GCP Console, monitor dataproc cluster and jobs. Stack Driver to monitor Dashboards and do a performance tuning and optimization of jobs which are memory intensive and provide L3 support for the applications in production environment
  • Responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake
  • Load the data from different sources such asHDFSorHBaseintoSpark RDDand implement in memory data computation to generate the output response.
  • Build the Logical and Physical data model for snowflake as per the changes required
  • Developed complete end to endBig-dataprocessing inHadoopeco system.
  • UsedHiveto analyze the partitioned andbucketeddata and compute various metrics for reporting on the dashboard.
  • Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
  • Created Airflow Scheduling scripts in Python to automate the process of Sqooping wide range of data sets.
  • Evaluate Snowflake Design considerations for any change in the application
  • Involved inPL/SQLquery optimization to reduce the overall run time of stored procedures.
  • UsedHiveto analyze the partitioned andbucketeddata and compute various metrics for reporting on the dashboard.
  • Worked on configuring and managing disaster recovery and backup onCassandraData.
  • UtilizedOozieworkflow to runPigandHiveJobs Extracted files fromMongo DBthroughSqoopand placed inHDFSand processed.
  • Continuously tunedHive UDF'sfor faster queries by employing partitioning andbucketing.
  • Implemented partitioning, dynamic partitions and buckets inHive.
  • Implemented Installation and configuration of multi-node cluster on Cloud usingAmazon Web Services (AWS)onEC2.
  • Automated feature engineering mechasims using Python scripts and deployed on Google cloud platform (GCP) and BigQuery
  • CreatedHiveExternal tables to stage data and then move the data from Staging to main tables
  • Worked in exporting data from Hive tables into Netezza database.
  • Implemented theBig Datasolution usingHadoop, hiveandInformaticato pull/load the data into theHDFSsystem.
  • Setup multiple Snowflake Pipelines and Warehouse for various data consumption.
  • Pulled the data from data lake(HDFS)and massaging the data with variousRDDtransformations.
  • Using rest API with Python to ingest Data from and some other site to BIGQUERY.
  • Used best practice methods to clean, manipulate, transform, and merge datasets in Python.
  • DevelopedScala scripts, UDF'susing bothData frames/SQLandRDD/MapReduceinSparkfor Data Aggregation, queries and writing data back intoRDBMSthroughSqoop.
  • Objective of this project is to build a data lake as a cloud based solution inAWSusingApache Spark.
  • Prepared Data models and schema on GCP for different projects based on star and snowflake schema designs
  • Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
  • Worked with google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
  • Created BigQuery authorized views for row level security or exposing the data to other teams.
  • Developed complex SQL queries, views, functions and reports that qualify customer requirements on Snowflake
  • DevelopedSparkcode usingScalaandSpark-SQL/Streamingfor faster processing of data.
  • CreatedData Pipelineusing Processor Groups and multiple processors usingApache NiFiforFlat File, RDBMSas part of aPOCusingAmazon EC2.
  • Build Hadoop solutions forbig dataproblems usingMR1andMR2inYARN.
  • UsedFlumeto collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed toHDFS.
  • Setup data reader accounts and Micro-Partition tuning across the Snowflake database for performance tuning.
  • Scheduled the jobs using Airflow and also used airflow hooks to connect to various traditional databases like db2, oracle and Teradata.
  • Supported in setting upQAenvironment and updating configurations for implementing scripts withPig, HiveandSqoop.

Environment: GCP, Apache Spark, Hive, Informatics, HDFS, MapReduce, Scala, Apache Nifi, Yarn, HBase, PL/SQL, Mongo DB, Pig, Sqoop, Snowflake, Flume.

Confidential, Cincinnati, Ohio

SR Data Engineer

Responsibilities:

  • Worked with the ETL team to document the transformation rules for Data migration from OLTP to Warehouse environment for reporting purposes.
  • Maintained and developed complex SQL queries, views, functions and reports that qualify customer requirements on Snowflake.
  • Worked on ingestion of applications/files from one Commercial VPC to OneLake.
  • Worked on building EC2 instances, Creating IAM user’s groups and defining policies.
  • Worked on creating S3 buckets and giving bucket policies as per client requirement.
  • Performed data wrangling to clean, transform and reshape the data utilizing pandas library.
  • Analyzed data using SQL, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.
  • Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature
  • Helped business people to minimize the manual work they were doing and created python scripts like LDA sourcing, OneLake, SDP. S3, Databricks, Databench, Snowflake to get the cloud metrics and make their efforts easier.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run.
  • Worked on the development of tools which automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.
  • Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run teh Airflow.
  • Performed analysis, auditing, forecasting, programming, research, report generation, and software integration for an expert understanding of the current end-to-end BI platform architecture to support the deployed solution
  • Analyzed the Incident, Change and Job data from snowflake and created a dependency tree-based model on the occurrence of incident for every application service present internally.
  • Advanced and developed test plans to ensure successful delivery of a project. Employed performance analytics predicated on high-quality data to develop reports and dashboards with actionable insights
  • Worked on catapulting data from teradata to snowflake to consume on Databricks.
  • Précised Development and implementation of several types of sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using Tableau.
  • Acquainted with parameterized sales performance reports, done the reports every month and distributed them to respective departments/clients using Tableau.
  • Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL
  • Creating the High Level and Low-Level design document as per the business requirement and working with offshore team to guide them on design and development.
  • Continuously monitoring for the processes which are taking longer than expected time to execute and tune the process.
  • Created scripts in Python (Boto3) which integrated with Amazon API to control instance operations.
  • Optimized current pivot tables' reports using Tableau and proposed an expanded set of views in the form of interactive dashboards using line graphs, bar charts, heat maps, tree maps, trend analysis, Pareto charts and bubble charts to enhance data analysis.
  • Written extensive Pyspark SQL queries for the defense and validated the queries as comparing to the results on the legacy teradata platform and snowflake cloud accordingly.
  • Monitor system life cycle deliverables and activities to ensure that procedures and methodologies are followed, and that appropriate complete documentation is captured.

Environment: Hive, Sqoop, Oozie, Python, Scala, Spark, Kafka, PySpark, MapReduce, Cassandra, Linux, AWS EMR, S3, Storm

Confidential, Minneapolis, MN

Data Engineer

Responsibilities:

  • Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
  • Developed Python AWS serverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable.
  • The Result Data set is stored in S3 and Snowflake for the Visualization - Tableau reports
  • Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • Configured documents which allow Airflow to communicate to its PostgreSQL database.
  • Researched and downloaded jars for Spark-avro programming.
  • Ingested the data from various data sources ( Confidential DB, Confidential, Snowflake) into AWS - S3 using the Spark-Scala JDBC connectors and snowflake connectors
  • Developed a PySpark program that writes dataframes to HDFS as avro files.
  • Utilized Spark's parallel processing capabilities to ingest data.
  • Created instances in AWS as well as worked on migration to AWS from data center.
  • Responsible for distributed applications across hybrid AWS and physical data centers.
  • Wrote AWS Lambda functions in python for AWS's Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
  • Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location
  • Creating AWS Lambda functions using python for deployment management in AWS and designed, investigated and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure
  • Created and executed HQL scripts that create external tables in a raw layer database in Hive.
  • Developed a Script that copies avro formatted data from HDFS to External tables in raw layer.
  • Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Flume, Oozie, HBase, Snowflake, Sqoop, RDBMS/DB, Flat files, MySQL, Java.

Confidential, Morrisville, NC

Big Data Hadoop Consultant

Responsibilities:

  • Worked with NoSQL databases likeHBasein makingHBasetables to load expansive arrangements of semi structured data.
  • Involved in transforming data from Mainframe tables toHDFS, andHBasetables using Sqoop.
  • Acted for bringing in data underHBaseusing HBase shell alsoHBaseclient API.
  • Experienced with handling administration activations usingClouderamanager.
  • Involved in developingImpalascripts for extraction, transformation, loading of data into data warehouse.
  • Migrated On prem informatica ETL process to AWS cloud and Snowflakes
  • Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices usingApache Flumeand stored the data intoHDFSfor analysis.
  • Collecting data from variousFlumeagents that are imported on various servers using Multi-hop flow.
  • Ingest real-time and near-real time (NRT) streaming data intoHDFSusingFlume.
  • Experience working with ApacheSOLRfor indexing and querying.
  • Created customSOLRQuery segments to optimize ideal search matching.
  • Involved in writing optimizedPigScript along with developing and testingPig LatinScripts.
  • Designed and implemented Incremental Imports intoHivetables and writing Hive queries to run onTEZ.
  • Reduced access time by refactoring data models, query optimization and implemented Redis cache to supportSnowflake
  • Involved in data ingestion intoHDFSusingSqoopfor full load and Flume for incremental load on variety of sources like web server,RDBMSand Data API’s.
  • InstalledOozieworkflow engine to run multipleHiveandPigjobs which run independently with time and data availability.
  • Implemented the workflows usingApache Oozieframework to automate tasks.
  • Involved in migrating tables fromRDBMSintoHivetables usingSQOOPand later generate visualizations using Tableau.
  • Created and maintained Technical documentation for launching Hadoop Clusters and for executingPigScripts.
  • CreatedETLMapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
  • Coordinated withSCRUMMaster in delivering agreed user stories on time for every sprint.

Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr.

Confidential

Python/Hadoop Consultant

Responsibilities:

  • Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
  • Experience in creating variousOoziejobs to manage processing workflows.
  • Involved in creatingOozieworkflow and Coordinator jobs to kick off the jobs on time for data availability.
  • Created a Dashboard for monitoring the failures using Python Flask and auto opening tickets in BMC ITSM remedy
  • Placed data into JSON files using Python
  • Used Django Database API's to access database objects.
  • Used pandas,NumPy, Seaborn, matplotlib, Scikit-learn, SciPy, NLTK in Pythonfor developing various machine learning algorithms.
  • UpdatedPython scriptsto match data with our database stored inAWSCloud Search, so that we would be able to assign each document a response label for further classification.
  • Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
  • Installed and configuredFlume,Hive,Pig,SqoopandOozieon the Hadoop cluster.
  • DevelopedFlumeAgents for loading and filtering the streaming data intoHDFS.
  • Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
  • Worked on various performance optimizations like using distributed cache for small datasets, partition and bucketing inHive, doing map side joins etc.
  • Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.
  • DevelopedPigScripts to store unstructured data inHDFS.
  • DevelopedPigLatin scripts to extract and filter relevant data from the web server output files to load into HDFS.
  • Analyzed the data by performingHivequeries and runningPigscripts to study customer behavior.
  • Enabled speedy reviews and first mover advantages by usingOozieto automate data loading into the Hadoop Distributed File System andPIGto pre-process the data.
  • Developed job workflow inOozieto automate the tasks of loading the data intoHDFSand few otherHivejobs.
  • UsedHiveto analyze the partitioned and bucketed data and compute various metrics for reporting.
  • OptimizedMapReduceJobs to useHDFSefficiently by using various compression mechanisms.

Environment: Hadoop, HDFS, Spark, HiveQL, Kafka, Pig, Airflow, Informatica, Oracle, PL/SQL, Sql Server, Oozie, UNIX, Linux, Shell Scripting.

Confidential

Python Developer

Responsibilities:

  • Built automated SQL scripts that generate flight delay and cancellation predictions on a daily basis.
  • Used Python and several python packages like pandas, numpy, sklearn, matplotlib for building machine learning statistical models and data manipulations
  • Involved in business approach preparation for solving the problem of travel ticket cancellations/extensions due to flight delays or flight cancellations
  • Data preparation, Feature selection (selecting features relevant to the problem) and outliers treatment
  • Wrote Python libraries for interfacing with Metric Insights RESTful API to extract content and publish to a Confidential Group.
  • Coding in Python (Linux, MySQL) environment.
  • Used flight delay predictions to identify the customers who might get affected
  • Managed large datasets using Pandas and da - gcp package and MySQL
  • Used Web Services to get travel destination data and rates

Environment: Python 3.x, MySQL, Microsoft Excel, Windows and Google cloud platform Django, SQL, Windows and Linux

We'd love your feedback!